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Objectives 


• Simulate Compressible Navier-Stokes Flow 
About General Geometries 

— Unstructured (Simplicial) Meshes 
— Stabilized Numerical Space Discretizations 

- Exact and Approximate Newton Iterations 

• Obtain Parallel Scalability and Efficiency 

— Schur Complement Domain Decomposition 
— ILU + GMRES on Subdomain Problems 

- Emphasis on Coarse Grain Parallism 

♦ SGI 0rigin2000 (64 processors) 

♦ SGI Array (40 processors) 

♦ CRAY J90 Cluster (20 processors) 
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Some Difficult Flow Problems 



Entrance/Exit Flow 


Recirculation Flow 


Boundary- layer Flow 
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Advection Dominated Model Problems 


iSS 


• 20 40 40 


GMRESMsufe- Vector IMtiffes 


u x + u v = 0 


yu x - XUy = Ijn £aUu 


Convergence of ILU+GMRES for Cuthill-McKee 
ordered matrices produced from scalar SUPG spatial 
discretization. 
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Stabilized Numerical Methods 
Advection Dominated Flows 


Model Equation: 


Ut+X * Vti = c A u, (a;, t) € ft x [0, T] 


with 


ti(x, 0) = tio(x), xGft 

u(x,t) = x € r 



• Stabilized F.E.M.: (Johnson, Hughes, et. al.) 
Find u€ Sh such that V w G V/» 


wutdCl + l tu(A*Vu)dft + / c (Vtu • Vti) d ft 


+/ (A • Vti — eAti) r (A • 10 — cAw) d ft 

Jn 

+ Discontinuity Capturing Operator = 0 
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ELF 

(an Element Library for Fluids) 


Euler equations in divergence form: 
ut + /(u)x + 9(u) y = 0, f(u),g(u) : R m i-> R m 

Euler equations in symmetric quasilinear form 
u »-> v: 

^Ao^Vt-\- AA qv x + BAo Vy = 0, Aq,A, B € R mxm 

SPD aymm 

ELF Methodology: Galerkin Least-Squares in 
symmetric variables 

ELF Input: Simplex geometry, Simplex dofs. 

ELF Output: Jacobian matrix and rhs terms for 
global assembly. 





Barth, Chan, Ting f 1 


Newton’s Method 


Fluid flow equations in semi-discrete form: 


Du t = R(u), R(u ) : R" -4 R" 


Backward Euler time integration: 




Let At vary with spatial position and ||ii(ti)|| so 
that Newton’s method is approached as 

ll-R( ti )ll o. 
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Solving A x = b 


Let 


Ax — b = 0 


represent a matrix problem taken from one step 
of Newton’s method. 

Solve using flexible GMRES in right 
preconditioned form: 

{AM- l )Mx-b = 0. 

• Allows nonstationary preconditioning 0^) 


Ideally 


k(AM~ 1 )=0( 1). 
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Preconditioners 


Some candidate preconditioners: 

1. ILU factorization 


- 

- -J » 


— Nonoptimal 
— Not easily parallized 

2. Additive Schwarz variants of ILU on 
overlapping meshes 

— Nonoptimal without coarse space correction 
- 3-D tetrahedral coarsening is problematic 

3. Multigrid 

— Similar difficulties as additive Schwarz 

•*’’ vov\ 7 

4. Nonoverlapping Schur domain decomposition 

— Data locality well* suited to coarse grain 
parallel architectures 

— Algebraic coarse space provides global 
coupling 
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Nonoptimality of ILU Preconditioning 


I ■: 


Advecttoo Dominated Problem 




! : \: 



GMRES Matrix- Vector MakipUa* 


GMRES Matrix -Vector Multiplies 


Figure 1-2. Performance of ILU for diffusion and 
advection dominated problems using scalar SUPG 
discretization and CuthilLMcKee ordering. 


ILU retains O (^r) condition number for 
elliptic problems. 

- Use of modified ILU variants is 
problematic in the nonsymmetric 
(advection dominated) limit. 


Barth, Chan, Ting (1938) 



Barth, Chan. T,\ng (1998) 








Additive Schwarz Preconditioned GMRES 

(Prom Barth, AGARD R-807, 1996) 


r 


«■ 10 
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100 200 300 400 

GMRES Matrix- Vector Products 


Effect of increased overlap and ILU fill. 


fcr® 


«■ io 


a t 


0 SO 100 ISO 200 
GMRES Matrix-Vector Products 


Effect of increasing the number of subdomains. 


— Inviscid flow about multi-element airfoil (6k vertices) 


— 1 sweep overlapped ILU with no coarse space correction 
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Schur Complement Domain Decomposition 

(Some Contributors) 


Basic Approach: Axelsson, Glowinski, Lions, Mandel, 
Nepomyaschikh, P6riaux, Przemieniecki, Widlund, Xu 

Interface Preconditioners: Bjorstad, Bramble, Chan, 
Keyes, Golub, Gropp, Mayers, Pasciak, Schatz, Smith 

" Parallel Implement At ion Issues: Chan. Keyes, Gropp, 
Smith 

See Also: Domain Decomposition: Parallel 
Multilevel Methods for Elliptic PDEs, B. Smith 
and P. Bjorstad and W. Gropp, Cambridge Press, 
1996 . 
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Nonoverlapping Domain Decomposition 
via Schur Complement 


12 1 













mm 




Domain decomposition and induced 2x2 matrix 
partitioning. 
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Solving a 2 x 2 Block Matrix 

Subdomain partition as illustrated. 

Order the subdomain variables before the 
interface variables. 


2x2 form of the system 

An Aib 


Ax = 


Abi Abb 


](:)=(£)' 


xi, x B denote the interior and the boundary 
variables. 

Block LU factorization of A: 

A = LU = 


A„ O' 

| I Ajl Aib 

AbT /. 

! 0 5 

m 


Eliminate xb from the reduced equations: 

Sxb = Sb — AbiAJj fi 


• Solve for xi by 

xi = AJj(fi — Aibxb ), 

• S = Abb — AbiAJj Aib is the Schur 
Complement of Abb in A. 
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Exact Schur Complement Preconditioning 

Preprocessing Step: Calculate the Schur 
complement 

S = Abb — ^.(ABi)i(An)~ 1 (AiB)i- 


Preconditioning Step: Solve Mx = r 


Step (1) 
Step (2) 
Step (3) 
Step (4) 
Step (5) 
Step (6) 


(Aifil 1 r i 

(Asi)i 

- r b - 52 v % 

S- X w b 
(AlB)i X b 
Ui - (An)i 1 Vi 


(parallel) 

(parallel) 

(comm) 

(s-parallel,comm 

(parallel) 

(parallel) 
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Scalability Experiments 
Fixed Subdomain Size, 4 x Partitions: 
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Interface Decomposition 




Overview (top) and closeup (bottom) of interface 
decomposition obtained via partitioning of the Schur 
complement supernode graph (64 subdomains, 4 
interfaces). 
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Global Preconditioning Strategy for Ax = b 

• Partition subdomains and interfaces 

• Assign a processor to each subdomain and 
interface component 

• Precondition subdomain problems using 
processor-local ILU 

• Precondition the Schur complement using 
drop-fiUed,processor-k>cal ILU factorizations 

• Implement flexible GMRES in a domain 
decomposed parallel environment 

- Parallel dot products 
— Parallel matrix- vector products 


w 
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Preconditioner I: Inexact Subproblem Solves 

Preprocessing Step: Calculate the approximate 
Schur complement 

S = Abb — f (ABi)i(Aii)J 1 (AiB)j . 


1. (A//)“ (A/a)i -> (A//)~ (A/B)i using mi steps 
of ILU+GMRES so that S ^ S. 

2. (A//)^ Zi Ajj Zi using mj steps of 
ILU+GMRES. 

3. 5 u;6 — ► 5 tut using m3 steps of 

ILU+GMRES. 

Preconditioning Step: Solve. Mx = r 


Step (1) 
Step (2) 
Step (3) 
Step (4) 
Step (5) 
Step (6) 


(Aii) { 1 n 

(Abi)i tit 

■ n-Y^Vi 

S~ l wt, 

(A/B)i Xb 

- 

Ui - ( Au)i yi 


(parallel) 

(parallel) 

(comm) 

(s-parallel,comm) 

(parallel) 

(parallel) 
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Iterations Required 


Preconditioner I: Performance 


Diffusion Dominated 


150 


100 


15k dofs (1 subdomain) 

10kdofs(l subdomain) 

— 40k dob (1 subdomaiu) 
15k dofs (4 subdomalM) 
• 10k dofs (16 subdoualw) 

-O- 40k dofs (64 subdomains) 


vj '• 


m_l=m_2=m_3 

— Scalar SUPG discretization with linear elements 


— Stopping criterion: ||Ax n — 6|| < 10“ 8 ||j4x o — &D 
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Iterations Required 


120 



2.5k dots (t MkdoMria) 

10k doft (1 nMomIi) 

40k dots (I mbdMMift) 
2.5k doft (4 «ub Jowiii) 
10k doft (16 mhdomii— ) 
-o- 40k doft (64 mb Joniai) 


% i 

AuUMMMM 

A % * \ 


m_l=m__2=m_3 


— Scalar SUPG discretization with linear elements 


— Stopping criterion: ||j4at n — 6|| < 10“ 8 ||j4aPo — 6|| 
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Observations for Preconditioner I 


• There exists an optimal number of subdomains 
for minimizing solution time. 

• Replacing a single long recurrence ILU 
factorization with several shorter recurrences 
coupled via Schur complement yields an 
improved quality preconditioner. 

• Choosing small values of the parameters mi, m2, 
■%;- and m3 < 3 leads to iniaiipum CPU times, (our 

experience). 

• Linear growth in the time needed to form the 
preconditioner is avoided by further partitioning 
of the interface. How to precondition the 
parallelized interface? Processor-Local ILU. 


Barth, Chan, Tang (1998) 



Preconditioner II: Drop Tolerance Approximation 

Drop elements of the Schur complement matrix, Siji 

(1) |5tj | < tol (scalar matrix entries) 

(2) Sparsity pattern (block matrix entries) 

— For discretized elliptic problems S exhibits faster 

element decay than (An )r* (Golub and Mayers). 

Preprocessing Step: Calculate the approximate 
Schur complement 

S = Abb — y^(AB/)i(Ajj)7 1 (A/B)» 

t 

- Precondition S with ILU (DROP(S)) 
Preconditioning Step: Solve Mx = r 


Step (1) 
Step (2) 
Step (3) 
Step (4) 
Step (5) 
Step (6) 


( A //)," 1 n 

(ABl)iUi 

r b -Ys v ' 

S- 1 w b 
(Aib)* x b 

ti» - (Ajj^yi 


(parallel) 

(parallel) 

(comm) 

(s- parallel, comm) 

(parallel) 

(parallel) 
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Preconditioner III: Wireframe Approximation 

• Main cost in Preconditioner II is the formation of 
S itself: one subdomain solve is needed for every 
variable on the interface. 

• Idea: Approximate S by the Schur complement of 
a relatively thin wireframe region surrounding the 
interface variables. 

• Take a principal submatrix of 

A: variables within the wire- 
frame. _ 


. WkiiiifitiHii 


Compute the - approxi- 
mate Schur complement of the 
interface variables in this prin- 
cipal submatrix instead of A 
itself. 


Wireframe LocaHzed 


e\\in£lc- 

From DD theory: the exact Schur complement of 
wireframe region is spectrally equivalent to the 
Schur complement of the whole domain. 
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Preconditioner III: Wireframe Approximation 

Pure Diffusion Problem: 


Support 


2 


3 


Mesh 

Sise 

Num 

Procs 

1600 

16 

1600 

16 

1600 

16 



Time 

Time 

Iter 

Form* 

Apply 

r— ■ 1 

15 


.11 

11 

.017 

mm 

6 

.020 

.04 


Pure Advection Problem: 


Mesh 

Size 

Num 

Procs 

1600 

16 

1600 

16 

1600 

16 


Support 


2 


3 ~ - 


Iter 


8 


& 


Time 

Form* 

Time 

Apply 

.012 

.09 

.017 

.06 

.019 

.03 


— Scalar SUPG discretization with linear elements 

— Stopping criterion: ||A* n — 6 || < 10” 9 ||A*o — &|| 

— No element dropping, mi = 2, mj = 2, m 3 = 2 
* Timing based on parallelized wireframe 
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Preconditioner III: Wireframe Approximation 




• Introduces a new adjustable parameter into the 
preconditioner. 

• Specify the width of the wireframe strip in terms 
of graph distance on the mesh triangulation. 

• Quality of the preconditioner improves rapidly 
with increasing wireframe width. 

• Time taken form the Schur complement is 
reduced by approximately 50%. 

• Further improvements in cost/performance: 
choosing the shape of the wireframe to better 
represent the PDE being solved, viz. the flow 
direction. 
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Preconditioner III: Wireframe Approximation 

• Preprocessing step: approximate Schur 
complement on tlie localised subdomains: 

S = Abb — (ABi)i(Au)^ 1 (AiB)i 

t 

— Axy' restrictions of Ah^Aib^Abi to the 
localized Schur subdomain(s). 

• The preconditioning step (solving Mx = r) is 
now: 


Step (1) 
Step (2) 
Step (3) 
Step (4) 
Step (5) 
Step (6) 


(An)i 1 n 

(Asi)i Ui 

: r b - ^2 Vi 
S~ x Wb 
(AiB)iXb 
Ui - (An )^ 1 yi 


(parallel) 

(parallel) 

(comm) 

(s-parallel,comm) 

(parallel) 

(parallel) 
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Preconditioner IV: Supersparse Approximation 
• Observations: 


1. Iterative calculation of AJjAjb is expensive. 

2. Columns of Aib are usually very sparse. 

3. Unpreconditioned Krylov subspace methods 
require computation of the vector sequence 
[p, Ap, A 2 p } .... . , A m p\ y p = col(B) for small m. 

4. ILU preconditioning destroys sparsity of the 
preconditioned Krylov subspace vectors 

[p, M~ x Ap , (AT 1 A) 2 p , . . . , (M" 1 A) m p]. 
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Preconditioner TV: Supersparse Approximation 
• Strategy: 

1. Eliminate wasted flop’s in matrix- vector 
products by storing vectors and matrices in 
sparse form. 

x x 

X 

X 

X X X 

X 

2. Approximate the ILU backsotves based on a 
vector fill level strategy to maintain sparsity. 

A x*b 
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Preconditioner IV: Supersparse Approximation 


• Results indicate preconditioning performance 
comparable to exact backsolves. 

• 60-70% reduction in cost. 

• This technique can be combined with the 
previous wireframe strategy with combined 5-7 
fold performance gains. 


Fill Level k I Global GMRES Iter I Time(fc)/Time(oo) 


0.299 


0.313 


0.337 


0.362 


.392 


1.000 



2-D test problem consisting of Euler flow past a 
multi-element airfoil geometry partitioned into 4 subdomains 
with 1600 mesh vertices in each subdomain. 
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0.0 23 SjO 7 3 IOjO 12J 

NouUomt hcrattoa 


0 10 20 30 40 30 60 

Global GMRES Matrix- Vector Prefects 


— Mach = .63, Angle of Attack = 2.0° 

— ELF Galerkin Least-Squares 

— Time Relaxed Newton Iteration 


— mi = 1 , m 2 = m 3 = 4, Fill Level=3 

— 4 Subdomain Mesh, (5k Cells/Subdomain) 


Barth, Chan, Tang (1998) 








Barth, Chan, Tang (1998) 









Barth, Chan, Tang (1998) 




Preliminary Timings 



Subdomain Interf=100 Subdomain Interf=150 Subdomain lnterf= 150-200 

(4 Interface Partitions) 


Raw Timings 


T~"T" 


Adjusted for Interface i 




1 T T 

! i i 

! i 


•of r — 1-^, 


t 


0 8 16 24 32 40 48 56 64 

Processors 

— 6 GMRES Iterations 


" ; *i | — -4- — f \ — -f- 


i f 


44 — t — — r 

......L. J — i-4- 


I i 

1 r 

i ! 

••4— t 

- 4-4 


0 8 16 24 32 40 48 

Processors 


56 64 


— 5 k Cells/Subdomain 

— IBM SP2, MPI Message Passing Protocol 

— mi = 1, mj = m 3 as 4, Fill Level=3 
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Performance of Schur Complement D.D 

For Viscous Flow 

*■ Hm * — «- 4 SuMocn^is PsoWoalng 



OntMl GMRES Matrix-Vector MakipUm 


Global GMRES Mavix-Vcctar Multiplies 


8 Subdomain Partiboeiaj 


16 Sabdomaia Pwfcioehg 




10 0 10 30 30 40 

10 0 10 20 30 

40 

Global GMRES Matrix-Vector Multiplies 

SUPG Discretization 

Global GMRES Matrix-Vector Multiplies 

Local CFL Number 100, 000 

4 n 1 \ \(o parVtMt’on % U j 

.‘VK w\es^ 

% r**e. 



j 
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Concluding Remarks 


• The baseline nonoverlapping scheme is very 
robust but relatively expensive 

• Localized wireframe and supersparse 
computations can significantly reduce cost of the 
forming the Schur complement preconditioner 

• Scalability: Growth of the Schur complement 
matrix necessitates partitioning of the interface 

• Scalability: Load balancing still problematic due 
to size imbalance of the interface associated with 
each subdomain 
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